JU)ckheed 

Electronics 


ASU ^ARYOF 
LOCKHEED CORPORATION 

1030 NASA Road 1, Houston, Texas 770SB 
Tel. 713<333>5411 


JSC-16238 
DEC 1 3 T979 


Company, Inc, 

Ref: 643-7863 
Job Order 73-302 
Contract NAS 9-15800 


TECHNICAL MEMORANDOM 

CLUSTERING ALGORITHM EVALUATION AND THE 
DEVELOPMENT OF A REPLACEMENT FOR 
PROCEDURE 1 


By 

R. K. Lennington and J. K. Johnson 


t A 

Approved By: 1 C. , 'll 



T. C. Mintei 

Supervisor 


Techniques Development Section 


(NASfl-CR-160424) CLUSTERING ALGORITHM 
EVALUATION AND THE DEVELOPMENT 07 A 
REPLACEMENT FOR PROCEDURE 1 (Lockheed 
Electronics Co.) f)0 p HC A05/MF A.01 

CSCL 09B G3/61 


N80-14772 


Unclas 

46522 




2. Coveffimtnt Accittlon No, 


3 Rtdp^tnt'i Citafog No. 


1. Rtport No, 

JSC- 16232 


4* Tidt ind Subtitle 

CLUSTERING ALGORITHM EVALUATION AND THE DEVELOPMENT OF A 
REPLACEMENT FOR PROCEDURE 1 


S. Report Dite 

November 1979 


6, Performing Orgenizstlon Code 


7. Autbor(s) 

R, K. Lennington end J. K. Johnson 
Lockheed Electronics Company, Inc. 


9. Performing Orgeniretion Neme end Addrtet 

Lockheed Electronics Company, Inc. 
Systems and Services Division 
1830 NASA Road 1 
Houston, Texas 77058 


12, Sponionng Agencv end Addr«s 

National Aeronautics and Space Administration 
Lyndon B, Johnson Space Center 
Houston, Texas 77058 
Technical Mon1 


15. Supplementery Notes 


&, Performing Orgenizition Report No. 

LEC-13945 


10. Work Unit No. 


T1, Contract or Grint No, 

NAS 9-15800 


13. Type of Report and Period Covered 

Technical Memorandum 


14. Sponsoring Agcf>cy Code 




18. Abstract 

This study was designed as a response to observed deficiencies In Procedure 1. A more 
efficient procedure would be to simply cluster the data using a completely unsupervised 
clustering algorithm and then use labeled pixels to either label the resulting clusters 
directly or to perform a stratified estimate using the clusters as the strata. 

In the new procedure, clustering is the primary machine processing step, and the most 
efficient clustering algorithm available was needed. Three algorithms, CLASSY, AMOEBA, 
and Iterative Self-Organizing Clustering System (ISOCLS), wp*e chosen for testing. 

An equally important part of defining a new proper ..-imation procedure was the selec- 
tion of a scheme for obtaining a stratified estimate or a method of labeling each 
cluster. Three stratified estimation schemes and three labeling schemes were considered. 

The evaluation and comparison of the algorithms and the six techniques for proportion 
estimation are documented in this report with recommendations. 


17. Key Word* (Suggested by Author(sM 
clustering algorithms 
stratified proportion estimation 
cluster I’belinr 
CLASSY, AMOEBA, and ISOCLS 


19. Security Claoif. (of this report) 

Unclassified 



20. Security Clauif, (of this page) 

21. No. of Pagei 

Unclassified 

79 



*Tor tale by the National Technical inlormatlon Service, Springfield, Virginia 22161 


iSC Form 1424 (flev Nov 75) 


ii 


NASA JSC 






















rvi 


CONTENTS 


Section Page 

1. BACKGROUND AND INTRODUCTION 1-1 

. CLUSTERING ALGORITHMS AND EVALUATION CRITERIA 2-1 

3. TECHNIQUES FOR CLUSTER-BASED PROPORTION ESTIMATION 3-1 

4. DATA SET AND EXPERIMENTAL DESIGN 4-1 

5. RESULTS 5-1 

6. CONCLUSIONS AND RECOMMENDATIONS 6-1 

7. REFERENCES 7-1 

Appendix 

CALCULATION RESULTS OF THE AVERAGE BIAS IN THE 
PROPORTION ESTIMATE, THE MEAN-SQUARE ERROR OF 
THE ESTIMATE, AND THE VARIANCE REDUCTION FACTOR 

AS COMPARED TO SIMPLE RANDOM SAMPLING A-1 


iii 



TABLES 


Table Page 

2-1 MPAD CLUSTER PARAMETER SET 2-2 

4- 1 DESCRIPTION OF THE TWENTY-ONE SEGMENTS USED 

IN THE STUDY 4-2 

« 

5- 1 PCC VALUES USING MAJORITY-RULE LABELING AND R VALUES 

FOR CLASSY, AMOEBA, AND ISOCLS 5-2 

5-2 MAJOriTY-RULE LABELING USING PROPORTIONAL ALLOCATION 

RESULTS FOR FIVE SEGMENTS 5-3 

5-3 MAJORITY-RULE LABELING USING SEQUENTIAL ALLOCATION 
RESULTS FOR FIVE SEGMENTS, THREE-PIXEL PER CLUSTER 
INITIAL ALLOCATION 5-4 

5-4 MAJORITY-RULE LABELING USING SEQUENTIAL ALLOCATION 

RESULTS FOR FIVE SEGI-iENTS 5-5 

5-5 STRATIFIED PROPORTION ESTIMATION USING PROPORTIONAL 

ALLOCATION RESULTS FOR TWENTY-ONE SEGMENTS 5-6 

5-6 STRATIFIED PROPORTION ESTIMATION USING SEQUENTIAL 
ALLOCATION RESULTS FOR FIVE SEGMENTS, THREE-PIXEL 
PER CLUSTER INITIAL ALLOCATION 5-7 

5-7 STRATIFIED PROPORTION ESTIMATION USING BAYESIAN 
SEQUENTIAL ALLOCATION RESULTS FOR TWENTY-ONE 
SEGMENTS, TWO-PIXEL PER CLUSTER INITIAL 

ALLOCATION 5-8 

5-8 POOLED VARIANCES FOR SEQUENTIAL ALLOCATION 

TECHNIQUES 5-13 

5-9 LSD FOR COMPARISON BETWEEN BAYESIAN SEQUENTIAL AND 
PROPORTIONAL ALLOCATION TECHNIQUES FOR STRATIFIED 
PROPORTION ESTIMATION 

5-10 VALUES FOR Rpj^oportional " ^Bayes sequential ^ 


IV 



Figures 


Figure Page 

3"1 Empirical purity distribution for CLASSY clusters 

over 10 segnitints compared with quadratic prior 3-7 

3-2 Empirical purity distribution for AMOEBA clusters 

over 10 segments compared with quadratic prior 3-8 

3-3 Empirical purity distribution for ISOCLS clusters 

over 10 segment? compared with quadratic priors 3-9 

3-4 Empirical purity distribution for CLASSY clusters 
over eight small proportion segments compared with 
exponential prior 3-11 

3-5 Empirical purity distribution for AMOEBA clusters 
over eight small proportion segments compared with 
exponential prior 3-12 

3-6 Comparison of quadratic and exponential priors at 

the value P = 0.211 3-16 

5-1 Histogram plots of R for stratified proportion 

estimation using proportional allocation 5-10 

5-2 Histogram plots of the reduction in mean-square 
error for stratified proportion estimation using 
Bayesian sequential allocation. .... 5-11 

5-3 Histogram plot of R for Procedure 1 based on 

approximately 60 pixels (type 2) per estimate 5-12 


v 



ACRONYMS 


AA Accuracy Assessment 

JSC Lyndon B. Johnson Space Center 

ISOCLS Iterative Self-Organizing Clustering System 

LACIE Large Area Crop Inventory Experiment 

LSD least significant difference 

NASA National Aeronautics and Space Administration 

Pixel picture element 

PCC percent of correct classification 

R the variance reduction criterion 


vi 



1. 3ACKGR0UND AND INTRODUCTION 


In performing machine classification of remotely sensed data, clustering has 
typically been used to analyze and determine the inherent data signatures. In 
the proportion estimation system developed during the Large Area Crop Inventory 
Experiment (LACIE) and called Procedure 1, the multispectral land satellite 
(Landsat) data was first clustered to obtain the spectral signatures. These 
signatures were then labeled and used to train a maximum likelihood classifier 
which classified each picture element (pixel) in the image into one of the 
labeled classes. The final step was to evaluate the performance of this clas- 
sifier on an independent labeled data set and to use the estimates of the 
omission and commission errors resulting from this evaluation to correct the 
bias in the classified data. Procedure 1, thus, required two sets of labeled 
data. A set of ap.jroximately 40 labeled pixels, called type 1 dots, was used 
to initiate the clustering and to 'abel the resulting clusters. Another set 
of approximately 60 labeled pixels, called type 2 dots, was used to evaluate 
the classifier and correct any bias in the overall proportion estimates for 
the labeled classes, 

Within the past year, different investigations have resulted in several impor- 
tant conclusions regarding the Procedure 1 system. One study (ref. 1} con- 
cluded that the labeled clusters agreed very closely with corresponding 
classifier results. This seems to imply that the classification is unnecessary. 
In h second series of studies (refs. 2 and 3), it was found that the overall 
variance of the proportion estimates, resulting from Procedure 1, were only 
smaller by a factor of about 0.7 (on the average) than the proportion estimates 
resulting from a simple random sample of 60 labeled pixels. The conclusion was 
that the machine processing, which comprised Procedure 1, was relatively 
inefficient. 

The current study was designed as a response to the observed deficiencies in 
Procedure 1. It appeared that the classification step was unnecessary and 
that a more efficient procedure would be to simply cluster the data using a 
completely unsupervised clustering algorithm and then use any labeled pixels 
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to either label the resulting clusters directly or to perform a stratified 
estimate using the clusters as the strata. Such an approach would have the 
advantage of eliminating the need for the type 1 dots as well as the machine 
classification step. 

Since clustering was to be the primary machine processing step in the new 
procedure, it was important to choose the most efficient clustering algorithm 
available. Three algorithms were ultimately chosen for testing. These algo- 
rithms were: 

a. CLASSY (refs. 4, 5, and 6) - an adaptive maximum likelihood algorithm 
developed at the National Aeronautics and Space Adn'inistration (NASA), 
Lyndon B. Johnson Space Center (JSC) 

b. AMOEBA (ref, 7) - an algorithm developed at Texas A&M University, 
employing both spectral and spatial information 

c. The Iterative Self-Organizing Clustering System (ISOCLS), (ref. 8) - a 
variant of the ISODATA algorithm of Ball and Hall (ref. 9), and the algo- 
rithm used in Procedure 1 

These algorithms were applied to each of 25 LACIE segments collected during 
the 1976-77 crop year. The details of the clustering algorithms and the meas- 
ures used in evaluating the clustering results are discussed in section 2 of 
this report. 

An equally important part of defining a new proportion estimation procedure 
was the selection of a scheme for obtaining a stratified estimate or a method 
of labeling each cluster. In this regard, three stratified estimation schemes 
and three labeling schemes were considered. The details of these schemes are 
described in section 3. A description of the data set and the experimental 
design is included in section 4. In section 5 is a summary of the primary 
results, and section 6 consists of the conclusions drawn from the observed 
results with appropriate recommendations. 
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2. CLUSTERING ALGORITHMS AND EVALUATION CRITERIA 


The clustering evaluation portion of the study consisted of running each of 
three different clustering algorithms on each of the 25 LACIE segments selected. 
The clustering algorithms tested were CLASSY, AMOEBA, and ISOCLS. 

CLASSY was run using three complete passes through the data where the data set 
consisted of every other pixel in the image. Clusters smaller than 2 percent 
of the scene were eliminated, 

ISOCLS was run with the standard iterative parameter set recommended by Wylie 
and Bean (ref. 10) and known as the MPAD cluster parameter set. The values 
of these parameters are given in table 2-1. The algorithm was started with 
40 randomly selected and unlabeled pixels from each image. 

AMOEBA was run with parameters specified by its developers at Texas A&M Uni- 
versity. The minimum number of clusters was set at five. 

Both CLASSY and AMOEBA were run on data which had been transformed to Kauth 
brightness and greenness coordinates on each pass (ref. 11). This reduced the 
dimensionality of the data by a factor of 2. ISOCLS was run on the full dimen- 
sional data in accordance with the standard practice during LACIE Phase III. 

Each of the algorithms tested produced cluster maps which were subsequently 
compared with digitized ground- truth maps. The ground- truth maps were pre- 
pared from ground- truth images having a resolution six times that of Landsat 
imagery. The higher resolution ground truth was converted to Landsat resolu- 
tion by applying majority rule to each six-subpixel area corresponding to one 
Landsat pixel. In the event of ties, the first label to receive the tying 
number of subpixels was chosen as the Landsat pixel label. 

By comparing the digitized ground truth with a cluster image, the proportion 
of each ground-truth class, making up each cluster, was determined. The pro- 
portions for the small-grains classes were then combined to give the proportion 
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TABLE 2-1 MPAD CLUSTER PARAMETER SET 


Parameter 

Number of channels 

8 

12 

16 

CLUSTERS 

60.0 

60.0 

60.0 

THRESHOLD 

8191 

8191 

8191 

SEP 

1 

1 

1 

PERCENT 

100 

90 

90 

STDMAX 

3.6 

3.6 

3.6 

DLMIN 

3.9 

4.1 

4.5 

NMIN 

50 

50 

50 

ISTOP 

8 

8 

8 

SEQUEN 

Split- 

combine 

Split- 

combine 

Split- 

combine 

DOTFIL 

(a) 

,,.,l 

(a) 

(a) 


^Randomly selected starting dots. 
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of small grains (P^) in each cluster. These data were used to calculate two 
different evaluation criteria for each clustered image. These criteria are 
called the variance reduction criterion (R) and the percent of correct classi- 
ficacion (PCC), using majority rule labeling. 


The R criterion represents the ratio of the variance of a proportion estimate 
based on a stratified random sample allocation (in which strata are the clus- 
ters) to the variance of a simple random sample proportion estimate. The 
equation for this ratio (when samples that are allocated to clusters are pro- 
portional to the size of the cluster) follows: 


R 


c 


z 

i“l 


P(1 - P) 


( 1 ) 


where 


c = total number of clusters 


N^- » total number of pixels in cluster i 


» total number of pixels in the segment 
P. = the proportion of small grains in cluster i 


P = the overall proportion of small grains in the segment. 

The parameters P^ and P were evaluated using the Accuracy Assessment (AA) digi- 
tized ground-truth data for each segment. 


The PCC criterion measures the proportion of pixels that would be correctly 
labeled or classified if each cluster were labeled by majority rule. The equa- 
tion for computing the PCC criterion may be written as follows: 


PCC = 


P^.>0.5 P_.<0.5 ^ 


P^.<0.5 


( 2 ) 


where P^. , N^. , and Nj are defined above. The first term represents the summa- 
tion over all clusters having > 0.5. These clusters would be labeled "small 
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grains" by majority rule. The second term represents the summation over all 
clusters having < 0.5. These clusters would be labeled "other" by majority 
rule. 

The R criterion serves as a measure of the efficiency of a clustering algo- 
rithm as used in a stratified sampling proportion estimation scheme, The PCC 
criterion, on the other hand, serves as an overall indicator of cluster purity 
and of the quality of a proportion estimate obtained by labeling clusters. 

The results of evalualing these criteria for each of the three clustering 
algorithms as applied to the 25 LACIE segments are given in section 5. 



3. TECHNIQUES FOR CLUSTER-BASED PROPORTION ESTIMATION 


The objective of performing clustering in the context of Procedure 1 replace- 
ment is to use the results of the clustering as a basis for obtaining a pro- 
portion estimate for a crop of interest. In this study, six different tech- 
niques for obtaining proportion estimates by labeling a subset of pixels from 
the image were explored. Three of these techniques result in a labeling of 
each cluster, whereas the other three produce estimates of the proportion of 
the crop of interest in each clustei\ We will refer to the first three tech- 
niques as cluster-labeling techniques and the last three as stratified propor- 
tijn estimation techniques. 

The various clus*'»r-labeling techniques ditfer from one another in the manner 
in which the subset jf pixels tc be labelel is selected. In one technique, 
pixels are allocated to each cluster, proportionally to the size of that 
cluster; that is, if n^ total pixels are to be labeled, then 

Hi = iij (3) 

is the number of pixels to be labeled from each cluster. It should be noted 
that if n^ is not an integer, it is rounded up or down. If this produces a 
total number of pixels less than n, the remaining pixels are selected first 
from the largest cluster, then the next largest, continuing in this manner. 
Clusters too small to receive a single pixel are lumped together, and an 
allocation is made to that lumped group. Following the pixel allocation, 
majority rule may be applied to label the cluster; that is, if 


where x^ = the number of pixels out of the n^ pixels labeled in cluster i 
that are the crop of interest. 
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Then the labeling rule is as follows; 

a. Label cluster i as the crop of interest if 

P >i 
M - 2 

b. Otherwise, label cluster i as being other than the crop of interest. 
The proportion estimate is obtained as 


P= E ^ 

p>i"t 

ri>2 


(5) 


The procedure just described will be called cluster labeling by proportional 
allocation. 


The other two cluster-labsling procedures tested were developed by H. D. Pore 
of Lockheed Electronics Company, Inc. (ref. 12). One approach, called cluster 
labeling by sequential allocation, labels pixels, selected at random, from a 
given cluster until a confidence interval for the estimated proportion of the 
crop of interest no longer contains one-half. 


The final cluster- labeling approach tested is called cluster labeling by 
sequential Bayesian allocation. In this approach a Bayesian estimate for pr, 
the probability that the true proportion of the crop of interest is less than 
or equal to one-half is developed. The formal equation is 




f(x 


iUQ 


f(e^. |x^.)d0 


f{x.|0.)g(0^.)de. 


( 6 ) 


where 0. = the true proportion of the crop of interest in cluster 1, 
g{0^) = the unknown prior distribution for the G^'s and as before x^ = the 
number of pixels out of the pixels labelled in cluster i that are the crop 
of interest. 
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The strategy is to select a form for g(0^) and calculate the form of P:^. Then 
one may continue sampling at random and labeling the samples selected until 
is smaller or larger than a fixed threshold. If Pi is smaller than a, then 
label cluster i as other than the crop of interest. If is greater than 
1 - a, then label the cluster as the crop of interest. Thus, in both duster 
labeling by sequential allocation and cluster labeling by Bayesian sequential 
allocation, labeling from a given cluster continues until a specified confi- 
dence on the label of that cluster is obtained. The Bayesian scheme uses the 
additional information of an estimated prior distribution on the true cluster 
purities produced by a given algorithm. The necessary labeling rules and 
equations for these two techniques are developed in (ref. 12) and repeated 
here. 


For cluster labeling by sequential allocation, the labeling rule is as follows: 
a. Continue labeling if 


'•= 349 ,. ^* 1 . 5343 ,) 


where 


J 


ni(n, - 1) 


or until 35 samples have been allocated, 
b. Otherwise, label by majority rule 

This interval provides an approximate confidence of 1 - 1/8 = 0.875 in the 
label for each cluster. 


For clu'^ter labeling by sequential Bayesian allocation, the labeling rule is 
as follows; 

a. Label two pixels from a given cluster. If x^ = 0 or 2, stop and label by 
majority rule. Otherwise, go to step b. 

b. Label three more pixels. If x.. = 1 or 4, stop and label by majority rule. 
Otherwise, go to step c. 
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c. Label two more pixels. If x^. = 2 or 5, stop and label by majority rule. 
Otherwise, go to step d. 

d. Label three more pixels. If x.| = 3 or 7, stop and label by majority rule. 
Otherwise, go to step e. 

e. Label three more pixels and label the cluster by majority rule. 

This labeling rule is derived using a uniform prior for g(e) and also provides 
an approximate probability of correct labeling of 1 - 1/8 = 0.875. 

The three techniques for stratified proportion estimation parallel the three 
cluster-labeling techniques just discussed, One possibility is to allocate 
a total of nj pixels such that each cluster receives an allocation proportional 
to its size. This proportional allocation is accomplished as described earlier 
in this section. The proportion estimate is then computed as 



The term represents an estimate of the proportion of cluster i which is the 

crop of interest. The remaining two techniques for stratified proportion 
estimation differ in the rules used for allocating pixels to cluster and in 
the equation used for obtaining the final estimate. As was the case for clus- 
ter labeling, both techniques are sequential in nature with one employing a 
Bayesian prior distribution. Both techniques were developed by fi. D. Pore 
(ref. 13). 

The concept of sequential sampling as H is used in these two techniques is 
to apply information obtained from previously allocated samples in determining 
which cluster should receive the new sample. Suppose n^- pixels have been 
allocated to cluster i, and x^. of these pixels are of the crop of interest. 
Then 


g2 _ y ( N f 

% - r \^t/ 


P^d - p^) 
n. -1 


( 8 ) 
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where 


P 



is an estiniate of the variance of the usual stratified proportion estimator 
as given in equation (7). Now the estimated expected value of is (if one 
more sample from the Hh cluster is taken) 


E 



1 (x. 
1 n+r 1 


+ 1 ) + (1 




(9) 


2 

where the variance based on n + 1 total samples if the last 

sample selected is from cluster i and is also the crop of interest, and 
On^l(Xi) is the variance if the last sample selected is from cluster 1 and is 
other than the crop of interest. 


The expected change in the estimated segment proportion variance due to an 
additional labeled sample from cluster i is then 



( 10 ) 


Written in terms of the basic variables this equation becoues 



Ni n. + 3 

W (n, - l)n^(n, + 1)^ 


x,(n, 


X,) 


( 11 ) 


The strategy for the first technique, which we shall call stratified propor- 
tion estimation using sequential allocation, is to first allocate at random 
a fixed number of pixels to each cluster for the purpose of obtaining an ini- 
tial estimate of the proportion of each cluster which is the crop of interest. 

2 

Then Aa. is computed for each cluster, and the next sample to be labeled is 

' 2 
allocated to the cluster with the largest value of Ao^. . This process con- 
tinues until a fixed number of pixels have been labeled. The oroportion esti- 
mate is then 


p = 



( 12 ) 
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The last technique, which is called stratified proportion estimation using 
Bayesian sequential allocation, is similar to the technique just described 
except that the additional information of a prior distribution on cluster 
purities is used. In this case we use the posterior Bayes estimate 


0^ = E{0.1x.) 



0f(xj0^)g{0^.)de. 


in place of the minimum variance unbiased estimator 


(13) 


A 


P 


i 



Although 6^ is not unbiased, it is the minimum mean-square-error estimator. 

Following an initial fixed allocation to each cluster, one may then use 

^ 2 ^ 
in place of in equations (8) and (9) to calculate for each cluster and 

proceed to allocate sequentially as before. The only difficulty is in the 

selection of a prior distribution on cluster purities. 


The prior distribution on cluster purities was chosen following an examination 
of the empirical distribution for each of the three clustering algorithms 
on a subset of 10 segments. These histograms representing percentage of clus- 
ters versus ground-truth percentage of small grains are given in figures 3-1, 
3-2, and 3-3. The similarity of these histograms and their general shape led 
to the belief that at least for segments having a moderate to large amount of 
small grains, a prior distribution which was quadratic in form would be 
appropriate. 

It seemed reasonable that the prior distribution, g(0), satisfy the follow- 
ing criteria. 

g{0) > 0 for all 0 < 0 < 1 

yl (14) 

/ g(e)d6 = 1 

Jo 
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Figure 3-1.— Empirical purity distribution for CLASSY clusters over 10 segments compared with 

quadratic prior. 
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Figure 3-3.— Empirical purity distribution for ISOCLS clusters over 10 segments compared with 

quadratic prior. 


and 


1 

0g(0)d6 » P 

where 




and -Is computed following the fixed allocation of pixels to clusters. 


These three conditions allow the specification of the three coefficients in 
the equation 

g(0) = a0^ + b0 + c 


These coefficients are 


a = 6 


b = 12(P - 1) 
c = 5 - 6P 


for 0.211 < P < 0.789 


(15) 


It should be noted that the b and c coefficients are only appropriate for a 
specified range of P values. If P is not in this range^ then g(0) will be 
negative at some point. 


The fact that a quadratic prior is only appropriate over a limited range of 
P values also seemed to be validated by empirical evidence. Figures 3-4 and 
3-5 show histograms of cluster purity for eight segments which had low ground- 
truth proportions of small grains. Clearly a quadratic prior is not appro- 
priate. On this basis, it was decided to select an alternate prior for seg- 
ments which had a small portion of the crop of interest. The prior for 
segments with a very large proportion of the crop of interest might reasonably 
be thought to be like a "flipped" version of the prior for small proportion 
segments. 
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Percent of small grains 


Figure 3-4 Empirical purity distribution for CLASSY clusters over eight 

small proportion segments compared with exponential prior. 
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Figure 3-5.— Empirical purity distribution for AMOEBA clusters over eight 
small proportion segments compared with exponential prior. 
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It was decided that the form of the prior for small proportion segments would be 


g(0) « 60"®^ - e = 0{0““ - 1) 

and that this distribution should satisfy the following constraints 

g{0) > = 0 for all 0 < 0 < 1 



g(0)de = 1 


g(l) = 0 



0g(0)d0 = P 


(16) 


(17) 


These constraints may be used to determine the parameters a and B which are 


a 


1 - 4P 
1 - 2P 


for 0 < P < 0.25 


P _ 1 - a 

This prior will ba called the exponential prior. In order to see 
quadratic and exponential priors fit the empirical cluster purity 
the follovn’ng calculations were made: 

a. The average ground-truth proportion of small grains in the 10 
to obtain the data reflected in figures 3-1, 3-2, and 3-3 was 

b. The average ground-truth proportion of small grains in the eight segments 
used to obtain the data reflected in figures 3-4 and 3-5 was computed. 

The first proportion, call it P^, was then used to calculate the coefficients 
a, b, and c [equation (15)] specifying a quadratic prior. This prior is 
plotted in figures 3-1, 3-2, and 3-3 as a smooth curve for comparison with the 
empirical histograms. Similarly, the average ground- truth proportion for the 
eight small proportion segments, call it Pg, was used to calculate the coeffi- 
cients a and 3 for an exponential prior. This prior is plotted as a smooth 
curve on figures 3-4 and 3-5. It is evident from examining figures 3-1 through 
3-5 that both prior distibutions seem to fit the empirical clustc; purity dis- 
tributions well. 


(18) 

how well the 
histograms, 

segments used 
computed. 
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In actual practice, both the sequential and the Bayesian sequential procodi re 
were initiated with random allocation of two pix^iS per cluster. Following 
this allocation, the Bayesian sequential procedure computes two different 
estimates of the segment proportion. One is given by 


A. 


P - 



(19) 


whereas the other is the Bayes posterior estimate based on a quadratic prior 
and an average proportion estimate of P « 0.34. The equation for this estimate 
is 



( 20 ) 


where 


+ 1}(x. + 2)(x. + 3)] + b[(x. + l)(x> + 2)(n. + 4)] + c[(x^ + 1)(n. + 3)(n, + 4)] 

e(n.,x.) ■ “■ — i i !— — 3 ! 1- i 3 1— . — 

a[(x^ ■ 1){x^ + 2)(n^ + 4)] + b[(x^ + 1)(n^ + 3)(n^ + 4)] + c[(n^ + 2)(n^ + 3){n^ + 4)] 

( 21 ) 


If 0.211 < P, then the quadratic prior is selected and 0 is used to reset the 
parameters a, b, and c. Sequential selection then proceeds with 


Aa^ = 




0{n^.x.)[l - 0(n.,x.)] 


"i - ^ 


0(n.,x. )0(n. + l,x^. + 1)[1 - e(n. + l,x. + 1)] 


"i 


[1 - 0{n^x.)]o(n. + l,x.)[1 - e(n. + l,x.)] 


( 22 ) 


After a number of dots have been allocated, an overall proportion estimate is 
obtained via equation (20), using the current values of the 0(n^. ,x^.) estimates. 
If 0.211 > P, then the exponential prior is used to calculate the parameters a 
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2 

and $, Sequential selection then proceeds with given by equation (22), 
using 





Yi ■ Y2 


(23) 


where 

Y^ “ (n^ + l)(n^)(n^ - 1) (x. + 1) 

Y 2 ® (n^ + 1 - cx)(n^ - a) ••• (x^. + 1 - a) 


After a number of dots have been allocated, an overall proportion estimate is 
obtained as before using equation (20), 

Figure 3-6 shows a comparison of the quadratic and exponential priors at the 
value P * 0.211, where the switch occurs from one to the other. The curves 
are close enough for this value of P that the decision as to which one to use 
is not critical. 


Outlined in this section are six different techniques for cluster based pro- 
portion estimation. As a way of summarizing these developments, a brief dis- 
cussion cn some of the expected characteristics of these techniques follows. 

Three cluster-labeling and three stratified proportion-estimation schemes have 
been considered. If the clusters are very pure, then cluster labeling should 
produce proportion estimates with small bias and very small variance. In 
addition, relatively few labeled pixels should be required to obtain these 
estimates, and the estimates themselves should not be very sensitive to occas- 
ional labeling errors. Cluster labeling using sequential allocation or Baye- 
sian sequential allocation provides a specified confidence in the labels of 
clusters. These techniques should require fewer dots to be labeled on the 
average than does cluster labeling using proportional allocation. 
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Figure 3-6.- Comparison of quadratic and exponential priors at the value of P = 0.211. 





If the clusters are significantly mixed, all of the cluster-labeling schemes 
will suffer. In this case, a more appropriate technique is provided by strat- 
ified proportion estimation. Stratified proportion estimation, using propor- 
tional allocation, provides theoretically unbiased estimates. The stratified 
proportion estimation, us'^ng sequential and Bayesian sequential allocation, 
are not theoretically unbiased but should produce estimates with a lower mean- 
square error for a given number of dots allocated than the proportional allo- 
cation approach. Both of the sequential techniques incorporate information 
about both the size and the estimated purity of clusters in perform-ng the 
dot allocation. 
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4. DATA SET AND EXPERIMENTAL DESIGN 


The data set for this study consisted of 25 LACIE segments selected at random 
from the Phase III (1976-1977) blind site data base. Eighteen of the segments 
are the same as those used in the secondary error analysis study (refs. 2 
and 3). Seven substitutions in the secondary error analysis data set were 
necessary because the original segments were not well registered to the digi- 
tized ground truth. The segments selected represent a cross section of the 
U.S. Great Plains. Both winter- and spring-wheat segments were included. 

Three segments in the data set were discovered to have significant amounts of 
strip fallow small grains where the strips were not resolved in the ground 
truth. These segments, 1648, 1739, and 1544, were clustered but were not eval- 
uated using the proportion-estimation schemes because reliable labels were 
not available for the strip fallow area. One other segment, 1079, was not 
evaluated using the proportion-estimation schemes because it was found to con- 
tain 27 percent abandoned winter wheat and was, thus, a very atypical segment. 
In table 4-1 is a listing of the 21 segments actually used in the testing, 
their location, the acquisitions used, and bhe proportion of small grains from 
the digitized ground truth. 

The experimental design for the evaluation of the six proportion-estimation 
techniques was that each of them were evaluated on a subset of five seg- 
ments selected from the set of 21 acceptable segments. The subset that was 
selected consisted of segments 1005, 1853, 1520, 1231, and 1060. After eval- 
uating these preliminary results, the most promising techniques were selected 
and run on the remair.der of the 21 segments. 

Each proportion-estimation technique - clustering algorithm combination - was 
repeated 100 times for each segment. Each repetition used a different pseudo 
random sequence in selecting pixels. Thus, it was possible to calculate ■'he 
average bias in the proportion estimate, the mean-square error of the esti- 
mate, and the R factor as compared to simple random sampling. These results 
are reported in the appendix. Averages and variances of these results over 
segments were also calculated. These results appear in section 5. 
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TABLE 4-1 DESCRIPTION OF THE TWENTY-ONE SEGMENTS USED IN THE STUDY 


Segment 

Location 

Acquisitions used 

Ground-truth 
proportion of 
small grains 

1005 (W) 

Cheyenne, Colorado 

7177, 7159, 6326, 6254 

0.348 

1032 (W) 

Wichita, Kansas 

7194, 7086, 6326, 6254 

.371 

1033 (W) 

Clark, Kansas 

7156, 6288 

.095 

1853 (W) 

Ness, Kansas 

7193, 7067, 6253 

.306 

1166 (W) 

Lyon, Kansas 

7190, 7154, 7082, 6286 

.066 

1512 (S) 

Clay, Minnesota 

7193, 7156 

.340 

1520 (S) 

Big Stone, Minnesota 

7174, 7156, 7120 

.301 

1577 (W) 

Platte, Nebraska 

7120, 6306 

.029 

1604 (S) 

Renville, North Dakota 

7143, 7125 

.524 

1606 (S) 

Ward, North Dakota 

7197, 7125 

.330 

1661 {S) 

McIntosh, North Dakota 

7159, 7123 

.414 

1899 (S) 

Walsh, North Dakota 

7193, 7175, 7157, 7122 

.596 

1231 (W) 

Jackson, Oklahoma 

7156, 7066, 6288 

.744 

1239 (W) 

Noble, Oklahoma 

7155, 7082, 6268 

.167 

1367 (W) 

Major, Oklahoma 

7155, 7101 , 6287 

.606 

1675 (S) 

McPherson, South Dakota 

7230, 7176, 7123, 6254 

.291 

1686 (S) 

' Beadle, South Dakota 

7194, 7140, 6307, 6254 

.194 

1803 (W) 

Shannon, South Dakota 

7178, 7159, 7123, 6255 

.032 

1805 (M) 

Gregory, South Dakota 

7211, 7158, 6307, 6290 

.164 

1059 (W) 

Ochiltree, Texas 

7157, 7121, 6325, 6307 

.437 

1060 (W) 

Sherman, Texas 

7158, 7068 

.231 


Symbol definition: 


M = Mixed 
S = Spring wheat 
W = Winter wheat 
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5. RESULTS 


The results of the study are summarized In two parts. The first part pertains 
to the evaluation of the clustering algorithms, and the second part is an 
evaluation and comparison of the six techniques for proportion estimation. 

The R, as compared to simple random sampling, and the PCC, using majority rule 
labeling, are given in table 5-1 for each of the three algorithms tested as 
applied to each of the 21 segments. Averages for each measure over segments 
are given at the bottom of the table along with an estimate of the standard 
deviation over segments. None of the averages are significantly different. 

In fact, it is striking how similar the average results are in view of the 
differences in the algorithms. This similarity will be further discussed in 
section 6, 

One significant difference is in the number of clusters produced by each algo- 
rithm. At the bottom of table 5-1, the average number of clusters and the 
standard deviation in the number of clusters are indicated. The average number 
of clusters nearly doubles when going from CLASSY to AMOEBA and doubles again 
in going from AMOEBA to ISOCLS. Economy in the number of clusters produced 
is generally considered a distinct advantage for a clustering algorithm. It 
is clearly an advantage in the stratified proportion-estimation techniques. 
Indeed the sequential stratified techniques reouire that a fixed number of 
pixels (usually 2) be allocated to each cluster initially. Thus, a large 
number of clusters means that a large number of pixels must be allocated 
before sequential allocation even begins. 

Presented in tables 5-2, 5-3, and 5-4 are the results for the three cluster- 
labeling schemes; and in tables 5-5, 5-6, and 5-7 are the results for the 
three stratified proportion-estimation schemes. The results presented in each 
table are averages and variances over the segments processed for each of Jie 
measures recorded, using a given scheme. For each scheme, with the exception 
of stratified proportion estimation using proportional allocation, the meas- 
ures recorded were the average bias, the mean-square error, and the reduction 
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TABLE 5-1.- PCC VALUES USING MAJORITY RULE LABt...,'lG AND 
R VALUES FOR CLASSY, AMOEBA, AND ISOCLS 


Segment 

CLASSY 

AMOEBA 

ISOCLS 

PCC 

R 

PCC 

R 

PCC 

R 


0.8398 

0.5671 

0.91 32 

0.6372 

0.8659 

0.6571 

1032 (W) 

.8975 

.3450 

.8541 

.4585 

,8367 

,4978 

1033 (W) 

.9050 

.8208 

.9151 

.7363 

.9247 

.6247 

1853 (W) 

.89 ,3 

.40/3 

.7926 

.6966 

.8859 

,4655 

1166 (W) 

,9333 

.8287 

.9388 

.7857 

.9386 

.6994 

1512 (S) 

.7110 

.8269 

.7621 

.7481 

.7576 

.7767 

1520 (S) 

.8361 

.5768 

.8522 

.5213 

.8546 

.5735 

1577 (W) 

.9678 

.9055 

.9678 

.9076 

.9684 

.8814 

1604 (S) 

,6877 

.3419 

.7318 

.7538 

.6749 

.7893 

1606 (S) 

.8229 

.6071 

.8002 

.6511 

.7958 

.7201 

1661 (S) 

.7260 

.7395 

.7523 

.6745 

.7184 

.7767 

1899 (S) 

.8427 

.4852 

.8555 

.4684 

.8426 

.5196 

1231 (W) 

.8773 

.4849 

.8926 

.4450 

.8788 

.4941 

1239 (W) 

.8508 

.7175 

.8702 

.6586 

.8601 

.7322 

1367 (W) 

.8023 

.5654 

.8198 

.5644 

.8051 

■ .6238 

1675 (S) 

.7929 

.7056 

.8060 

.6243 

.7890 

.7282 

1686 (S) 

.8352 

.7847 

.8485 

.6933 

.8400 

.8128 

1803 (W) 

.9681 

.8313 

.9701 

.7339 

.9733 

.6502 

1805 (M) 

,9052 

.5007 

.9199 

.4680 

.9219 

.4839 

1059 (W) 

.8448 

.4515 

.8667 

.4126 

.8768 

.4062 

1060 (W) 

.8583 

.5984 

.8824 

.5227 

.8757 

.6002 

Average 

.8476 

.6472 

.8521 

.6268 

.8488 

.6435 

Standard 

deviation 

.0754 

.1663 

.0688 

.1333 

.0771 

.1316 

Average number 
of clusters, 

+ 1 standard 
deviation 

9,32 ; 

t 2.15 

17.46 : 

t 10.15 

36.84 + 2.32 
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TABLE 5-2.- MAJORITY RULE LABELING USING PROPORTIONAL ALLOCATION RESULTS FOR FIVE SEGMENTS 


Number of 
pixels 

CLASSY 

AMOEBA 

ISOCLS 

CLASSY 

AMOEBA 

ISOCLS 







allocated 

Average bias 


Variance of bias 

30 

-0.009508 

-0.015600 

0.013634 

0.000839 

0.001999 

0.000202 

60 

.001838 

-.026056 

-.024830 

.002620 

.000596 

.000195 

90 

-.071312 

-.034964 

-.026952 

.022647 

.000651 

.000371 

120 

-.01 6828 

-.033568 

-.016600 

.001 955 

.000800 

.001039 


Average mean-square 

error 

Variance 

of mean-square error 

30 

0.024594 

0.057056 

0.011561 

0.000188 

0.002791 

0.000050 

60 

.054702 

.038171 

.029260 

.002262 

.000619 

.000205 

90 

.062212 

.050078 

.029656 

.005637 

.002679 

.000463 

120 

.047929 

.049945 

.033015 

.003409 

.002345 

.001398 


Average reduction in 
mean-square error 

Variance of reduction in 
mean-square error 

30 

3.585012 

8.984081 

1.747804 

3.608364 

83.195801 

1.329618 

60 

16.227G76 

11 .904598 

8.945074 

207.806641 

71.670441 

25.393509 

90 

27.270935 

24.033615 

13.662670 

1017.292236 

719.822998 

115.909088 

120 

27.489548 

32.010651 

20.962250 

1101.703857 

1113.502686 

631.753662 




TABLE 5-3.- MAJORITY RULE LABELING USING SEQUENTIAL ALLOCATION RESULTS FOR 
FIVE SEGMENTS, THREE-PIXEL PER CLUSTER INITIAL ALLOCATION 


CLASSY 

AMOEBA 

ISOCLS 

CLASSY 

AMOEBA 

ISOCLS 

1 Average bias 

Variance of bias 

-0.04449496 

-0.03424257 

-0.03201438 

0.00107109 

0.00053136 

0.00094198 

Average mean-square error 

Variance of mean square-error 

0.00574680 

0.00254860 

0.00266640 

0.00000913 

0.00000660 

0.00000073 

Average reduction in 
mean-square e^" or 

Variance of reduction in 
mean-square error 

1.67606068 

1 .24144173 

3.41460514 

0.90543842 

1.75853252 

1.39696312 

Average number of 
pixels allocated 

Variance of number of 
pixels allocated 

57.648 

75.286 

257.475 

68.674 

2042.372 

308.177 









TABLE 5-4.- MAJORITY RULE LABELING USING BAYESIAN SEQUENTIAL ALLOCATION RESULTS FOR 
FIVE SEGMENTS, TWO-PIXEL PER CLUSTER INITIAL ALLOCATION 


CLASSY 

AMOEBA 

ISOCLS 

CLASSY 

AMOEBA 

ISOCLS 

Average bias 

Variance of bias 

-0.03277557 

-0-02864778 

-0.02584878 

0-00060669 

0.00038843 

0.00079368 

Average mean-square error 

Variance of mean-square error 

0.00604460 

. .. . 

0.00682659 

0.00267940 

0.00000393 

0.00000916 

0.00000062 

Average reduction in 
mean-square error 

Variance of reduction in 
mean-square error 

0.91108280 

1.38561249 

1.65233707 

0.13923180 

0.85401917 

0.18573952 

Average number of 
pixels allocated 

Variance of number of 
pixels allocated 

29.930 

43.074 

125.996 

23.486 

566.810 j 47.896 




TABLE 5-5.- STRATIFIED PROPORTIO» ESTIMATION USING PROPORTIONAL ALLOCATION 

RESULTS FOR TWENTY-ONE SEGMENTS 


Number of 
pixels 
allocated 

CLASSY 

AMOEBA 

ISOCLS 

CLASSY 

AMOEBA 

ISOCLS 

Average variance 

Variance of variance 

30 

0.003852895 

0.003591756 

0.003565516 

0.00C004197 

0.000002433 

0.000002063 

60 

.001815951 

.001814903 

.001715998 

.000CQ0648 . 

.000000738 

.000000464 

90 

.001 301 855 

.001269474 

.001444855 

.000000391 

.000000339 

.000000871 

120 

.000884570 

.000945522 

.000986570 

.OOOOC0143 

.000000164 

.000000350 


Average reduction in variance 

Variance of reduction in variance 

30 

0.687449038 

0.627526164 

0.636414111 

0.053946018 

0.019914806 

0.025356948 

60 

.636317074 

.626016080 

.629446924 

.023804247 

.031225204 

.042545319 

90 

.688710690 

.656349719 

.694832742 

.041802645 

.024449527 

.042262435 

120 

.636751771 

.662965417 

.624346912 

.028034508 

.024315834 

.023863912 




TABLE 5-6.- STRATIFIED PROPORTION ESTIMATION USING SEQUENTIAL ALLOCATION RESULTS FOR 
FIVE SEGMENTS, THREE-PIXEL PER CLUSTER INITIAL ALLOCATION 


Number of 
pixels 

CLASSY 

AMOEBA 

ISOCLS 

CLASSY 

AMOEBA 

ISOCLS 







allocated 

Average bias 


Variance of bias 

30 

-0.00088333 

-0.00585000 

0.0 

0.00015393 

0.00003784 

0.0 

60 

-.01415999 

02248665 

.0 

.00036671 

.00009266 

.0 

90 

-.01781999 

-.02010199 

.0 

.00045373 

.00013612 

.0 

120 

-.01948998 

-.02173998 

-.00385000 

.00046703 

.00017864 

.00007823 


Average mean-square 

error 

Variance of mean-square error 

30 

0.00345100 

0.00513500 

0.0 

0. 00000020 

0. 00000001 

0.0 

60 

.00296520 

.00325900 

.0 

.00000024 

.00000002 

.0 

90 

.00277940 

.00298240 

.0 

.00000030 

.00000090 

.0 

120 

.00274540 

.00276980 

.001 24575 

.00000035 

.00000087 

.00000015 


Average reduction 
mean-square error 

in 

Variance of average reduction 
in mean -square error 

30 

0.54175025 

0. 72903204 

0.0 

0.01088542 

0.00039721 

0.0 

60 

.87602842 

.98629665 

.0 

.01731825 

.01368725 

.0 

90 

1.23414421 

1 . 30850601 

.0 

.05600834 

.13552380 

.0 

120 

1.62500954 

1.61916065 

. 70379806 

.11822701 

-■ 

.24639034 

.03868544 




TABLE 5-7.- STRATIFIED PROPORTION ESTIMATION USING BAYESIAN SEQUENTIAL ALLOCATION 
RESULTS FOR TWENTY-ONE SEGMENTS, TWO-PIXEL PER CLUSTER INITIAL ALLOCATION 


Number of 
pixels 
allocated 

CLASSY 

AMOEBA 

ISOCLS 

CUSSY 

AMOEBA 

ISOCLS 

.... 

Average bias 


Variance of bias 

30 

0.00036809 

-0.00841666 

0.0 

0.00010890 

0-00051509 

0.0 

60 

.00006095 

-.00430625 

.0 

.00012138 

.00013838 

.0 

90 

-.00037000 

-.00495141 

-.00323619 

.00008227 

.00020197 

.00007368 

120 

-.00040190 

-.00451095 

-.00324428 

.00006833 

.00017815 

.00007746 


Average mean-square error 

Variance 

of mean-square error 

30 

0.00285286 

0.00522211 

0.0 

0.00000367 

0.00000503 

0-0 

60 

.00148009 

.00212906 

.0 

.00000065 

.00000119 

.0 

90 

.00099690 

.00140800 

.00099719 

.00000030 

.00000059 

.00000021 

120 

.00073538 

.00106862 

.00075933 

.00000015 

.00000035 

.00000012 


Average reduction 
mean-square error 

in 

Variance of reduction in 
mean-square error 

30 

0.48676664 

0.76839358 

0.0 

0. 0450471 0 

0.10229522 

0.0 

60 

.51693314 

.72288340 

.0 

.03661084 

.06289172 

.0 

90 

,52017057 

.72251660 

.51264614 

.03732508 

.07170510 

.01777804 

120 

.51932829 

.73885107 

.52794492 

.03581393 
■ — — 

.08057529 

.02143240 




in mean-square error as compared to simple random sampling. Because stratified 
proportion estimation (using proportional allocation) Is theoretically unbiased, 
the bias was not recorded; the variance and the R, rather than the mean-square 
error and reduction in mean-square error, were recorded. The techniques using 
sequential allocation for majority-rule labeling did not allocate a fixed num- 
ber of pixels, and hence, only the average number of pixels allocated is 
reported. The sequential Bayesian technique used an initial allocation of two 
pixels per cluster, whereas the sequential technique without prior used a 
three-pixel cluster initial allocation. The same initial allocation was used 
for the Bayesian and "no prior" sequential techniques that were used in strat- 
ified proportion-estimation. The missing values in tables 5-6 and 5-7 indicate 
that in some cases sequential allocation could not begin until a larger number 
of dots had been allocated. 

After examining the results for the subset of five segments, it was clear that 
all of the cluster-labeling schemes as well as the stratified proportion esti- 
mation using sequential allocation were not competitive with stratified pro- 
portion estimation using either proportional allocation or Bayesian sequential 
allocation. This is most readily apparent in a comparison of the reduction in 
mean-square error or R results. 

The technique using sequential allocation in obtaining stratified proportion 
estimates does look competitive at an allocation of 30 pixels. Because it 
was not significantly better than stratified proportion estimation using 
Bayesian sequential allocation, it was decided to place the most emphasis on 
a comparison of the Bayesian sequential and the proportional allocation tech- 
niques as used in obtaining stratified proportion estimates. Consequently, 
tables 5-5 and 5-7 represent results for the full 21 segments, whereas 5-2, 

5-3, 5-4, and 5-6 represent the results for five segments. 

Figures 5-1 and 5-2 are a presentation in histogram form of the same data 
which are summarized in tables 5-5 and 5-7. Figure 5-3 is a comparative histo- 
gram plot of R values for Procedure 1, which are reported in reference 3. In 
this plot, it is assumed that there is an allocation of pixels equal to the 
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Figure 5-2.- Histogram plots of the reduction in mean-square 
error for stratified proportion estimation using Bayesian 
sequential allocation. 



0 0.4 0.8 1.2 

Procedure 1 


Figure 5-3.- Histogram plot of the R for Procedure 1 
based on approximately 60 pixels (type 2) per 
estimate. 

number of type 2 dots used in each estimate (approximately 60 pixels). The 
complete data for each of the six proportion-estimation techniques studied are 
in the appendix of this report. 

The results in table 5-5 are essentially an empirical verification of the 
results in table 5-1. In particular, the R averages may be compared. In 
theory, the P (using this technique) should be independent of the number of 
dots allocated. Indeed, there are no significant differences among the values 
of average R calculated for 30, 60, 90, or U'O dots. In addition, the averages 
for each algorithm tend to agree well with the theoretical average R values 
appearing in table 5-1. 

In examining table 5-7, it is clear that the Bayesian sequential allocation 
technique, as used in obtaining stratified proportion estimates, has an ex- 
tremely low bias for all three algorithms even though the procedure itself is 
not theoretically unbiased. None of the average bias results in this table 
for any of the algorithms are significantly different from zero. 

A comparison of the average reduction in mean-square error for the Bayesian 
sequential allocation technique (table 5-7) with the average R for the pro- 
portional allocation technique (table 5-5) shows that using the Bayesian 



sequential approach with the CLASSY algorithm gives results which are consis- 
tently lower than proportional allocation for all numbers of pixels allocated. 
If the variances for each technique-algorithm combination are pooled over the 
various numbers of pixels allocated, the results are given in table 5-8. 


TABLE 5-8.- POOLED VARIANCES FOR SEQUENTIAL ALLOCATION TECHNIQUES 


Pool 

Variances 
1 

Bayesian sequential 
allocation 

Proportional allocation 

CLASSY 

AMOEBA 

ISOCLS 

CLASSY 

AMOEBA 

ISOCLS 

0.038699 
1 

0.079350 

0.019605 

0.036897 

0.024976 

0.033507 


In table 5-9 are the least significant differences (LSD) for comparisons 
between the two sequential techniques within the results for a given family. 
The LSD is computed as 


LSD = t 




(24) 


where S^ and are the pooled variance estimates of the groups to be compared 
and t is the 0.975 percentage point of the Student's-t distribution with 
80 degrees of freedom - 1,99, 


TABLE 5-9.- LEAST SIGNIFICANT DIFFERENCES FOR COMPARISONS BETWEEN 
BAYESIAN SEQUENTIAL AND PROPORTIONAL ALLOCATION TECHNIQUES 
FOR STRATIFIED PROPORTION ESTIMATION 


LSD in 
R values 

CLASSY 

AMOEBA 

ISOCLS 

0.119397 

0.140262 

0.100078 
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The differences between the corresponding R values for tables 5-5 and 5-7 are 
given in table 5-10. 


TABLE 5-10, VALUES FOR ^^proportional " %ayes sequential 


Pixels 

CLASSY 

AMOEBA 

ISOCLS 

30 

®0. 200682 

^-0.140867 


60 

*^0.119384 

-.086566 


90 

®0.168540 

-.066167 

®0.182187 

120 

^0.116789 

-.075886 

*^0.096402 


^Significant at the 0.05-percent level, 

^Marginally significant at the 0.05-percent level. 

An examination of table 5-9 shows that the CLASSY results for each number of 
pixels and the ISOCLS results for 90 and 120 pixels are either significant or 
very nearly significant at the 0.05-percent level, ISOCLS results are not 
available for 30 and 60 pixels as there were more pixels than 60 allocated 
following the two-pixel per cluster allocation in the Bayesian sequential pro- 
cedure. The AMOEBA results for the Bayesian procedure are consistently higher 
than for the proportional allocation procedure, and in the case of 30 pixels 
allocated, the reduction in mean-square-error value was significantly higher. 
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6. CONCLUSIONS AND RECOMMENDATIONS 


The clustering algorithms CLASSY, AMOEBA, and ISOCLS performed comparably with 
respect to the PCC using majority- rule labeling and the R measures. The fact 
that the average results for all three algorithms were so similar and that the 
average R value for Procedure 1 has been reported in several independent 
studies to be about this same value {0.65 - 0.70) suggests there is a funda- 
mental limitation in the separability of the data which precludes better per- 
formance. This idea should be tested further in later studies. The fact that 
CLASSY had, on the average, only about 9 clusters, whereas AMOEBA had about 
17, and ISOCLS had almost 37 is seen as important. Given the same overall 
level of performance, an economy in the number of clusters produced is to be 
preferred. 

The cluster-labeling techniques appear to suffer from the same fate. The pro- 
portion estimates obtained using these techniques were generally biased; the 
R-values were always greater than 0.9 and typically they were greater than 1. 
This poor performance for all of the clustering algorithms indicates that 
clusters were simply not pure enough for cluster labeling to function effi- 
ciently as a proportion-estimation technique. For all three clustering algo- 
rithms, the average PCC value, which may be thought of as a measure of cluster 
purity, was about 0.85. Apparently, much greater cluster purity is needed for 
cluster labeling to be a viable approach. 

The stratified proportion-estimation techniques generally worked well. The 
sequential allocation approach with no prior distribution on cluster purities 
produced good results for an allocation of 30 pixels; however, the results for 
allocations of 60, 90, and 120 pixels were biased and had much larger reduction 
in mean-square error values for all of the clustering algorithms. In addi- 
tion, these results were obtained with an initial allocation of three pixels 
per cluster, which means that in many cases, sequential allocation did no-t- 
begin until more than 30 pixels had been allocated. 

The study eventually focused on a comparison of the Bayesian sequential allo- 
cation technique and the proportional allocation technique for stratified 
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proportion estimation. Both of these techniques are unbiased. The propor- 
tional allocation technique has an R value of about 0.67 which does not differ 
significantly from algorithm to algorithm or for different numbers of pixels 
allocated. This result is also not much different from the Procedure 1 value. 
However, the Bayesian sequential allocation technique, when used with the 
CLASSY or ISOCLS clustering algorithm, has significantly lower reduction in 
mean-square-error values than does proportional allocation. The fact that 
CLASSY has many fewer clusters than ISOCLS and, thus, is able to begin allo- 
cating sequentially at a much lower number of dots makes it the preferred 
algori thm. 

The recommendation of this report is that studies be undertaken to determine 
how best to implement stratified proportion estimation using CLASSY clusters 
as the strata and the Bayesian sequential technique for pixel allocation. It 
appears that a total allocation of 30 pixels would achieve the minimum R. The 
average mean-square error for this number of pixels is 0.002853, which com- 
pares very favorably with the average variance of 0.002515 calculated from 
the results of the Procedure 1 secondary error analysis study (ref. 3). This 
variance for Procedure 1 was obtained with about 100 labeled pixels for each 
estimate (= 40 type 1 pixels plus = 60 type 2 pixels). Thus, an allocation 
of only 30 total dots represents a very clear advantage for the proposed 
replacement procedure for Procedure 1. 
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APPENDIX 

CALCULATION RESULTS OF THE AVERAGE BIAS IN THE PROPORTION ESTIMATE, 
THE MEAN-SQUARE ERROR OF THE ESTIMATE, AND TflE VARIANCE REDUCTION 
FACTOR AS COMPARED TO SIMPLE RANDOM SAMPLING 
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